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Abstract 

We propose a new algorithm to solve sparse linear systems of equa- 
tions over the integers. This algorithm is based on a p-adic lifting tech- 
nique combined with the use of block matrices with structured blocks. It 
achieves a sub-cubic complexity in terms of machine operations subject to 
a conjecture on the effectiveness of certain sparse projections. A LinBox- 
based implementation of this algorithm is demonstrated, and emphasizes 
the practical benefits of this new method over the previous state of the 
art. 



1 Introduction 

A fundamental problem of linear algebra is to compute the unique solution of 
a non-singular system of linear equations. Aside from its importance in and of 
itself, it is key component in many recent proposed algorithms for other prob- 
lems involving exact linear systems. Among those algorithms are Diophantine 
system solving ^| E| [20] , Smith form computation [Sj |2] , and null-space and 
kernel computation In its basic form, the problem we consider is then to 
compute the unique rational vector A^^b G Q"^^ for a given non-singular ma- 
trix A £ Z"^" and right hand side & e Z"^^. In this paper we give new and 
effective techniques for when A is a sparse integer matrix, which have sub-cubic 
complexity on sparse matrices. 

* Author is currently affiliated to LP2A laboratory, University of Perpignan 
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A classical and successful approach to solving this problem for dense integer 
matrices A was introduced by Dixon in 1982 [5], following polynomial case 
studies from J2| ■ His proposed technique is to compute, iteratively, a sufficiently 
accurate p-adic approximation A~^b mod p'^ of the solution. The prime p is 
chosen such that det(A) ^ mod p (see, e.g., [32] for details on the choice of 
p). Then, using radix conversion (see e.g. [HI §12]) combined with continued 
fraction theory ^| §10], one can easily reconstruct the rational solution A~^b 
from A~^b modp*^ (see |2S| for details). 

The principal feature of Dixon's technique is the pre-computation of the 
matrix A~^ mod p which leads to a decreased cost of each lifting step. This 
leads to an algorithm with a complexity of 0~(n^ log(l|A|j + bit operations 
Here and in the rest of this paper || . . . || denotes the maximum entry in 
absolute value and the 0~ notation indicates some possibly omitting logarithmic 
factor in the variables. 

For a given non-singular matrix A G Z"^", a right hand side b E Z,"^^, and 
a suitable integer p, Dixon's scheme is the following: 

• compute B = A~^ modp; 

• compute £ p-adic digits of the approximation iteratively by multiplying B 
times the right hand side, which is updated according to each new digit; 

• use radix conversion and rational number reconstruction to recover the 
solution. 

The number £ of lifting steps required to find the exact rational solution to 
the system is ©"(n log(|| A|j + \\b\\)), and one can easily obtain the announced 
complexity (each lifting steps requires a quadratic number of bit operations in 
the dimension of A; see for more details). 

In this paper we study the case when yl is a sparse integer matrix, for 
example, when only 0~{n) entries are non-zero. The salient feature of such 
a matrix A is that applying A, or its transpose, to a dense vector c G Z"'*^ 
requires only 0~(nlog(]] A|| + \\c\\)) bit operations. 

Following techniques proposed by Wiedemann in |26| . one can compute a 
solution of a sparse linear system over a finite field in 0~{n^) field operations, 
with only 0~{n) memory. Kaltofen & Saunders studied the use of Wiede- 
mann's approach, combined with p-adic approximation, for sparse integer linear 
system. Nevertheless, this combination doesn't help to improve the bit com- 
plexity compared to Dixon's algorithm: it still requires 0'{n^) operations in the 
worst case. One of the main reasons is that Wiedemann's technique requires 
the computation, for each right hand side, of a new Krylov subspace, which re- 
quires 0{n) matrix-vector products by A mod p. This implies the requirement 
of Q{n^) operations modulo p for each lifting step, even for a sparse matrix 
(and 6(n(log \\A\\ + \\b\\)) such hfting steps are necessary in general). The only 
advantage then of using Wiedemann's technique is memory management: only 
0{n) additional memory is necessary, as compared to the O(n^) space needed 
to store matrix inverse modulo p explicitly, which may well be dense even for 
sparse A. 

The main contribution of this current paper is to provide a new Krylov-like 
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pre-computation for the p-adic algorithm with a sparse matrix which allows us 
to improve the bit complexity of linear system solving. The main idea is to use 
block-Krylov method combined with special block projections to minimize the 
cost of each lifting step. The Block Wiedemann algorithm ^ |^ would be 
a natural candidate to achieve this. However, the Block Wiedemann method 
is not obviously suited to being incorporated into a p-adic scheme. Unlike the 
scalar Wiedemann algorithm, wherein the minimal polynomial can be used for 
every right-hand side, the Block Wiedemann algorithm needs to use different 
linear combinations for each right-hand side. In particular, this is due to the 
special structure of linear combinations coming from a column of a minimal 
matrix generating polynomial (see and then be totally dependent on 

the right hand side. 

Our new scheme reduces the cost of each lifting step, on a sparse matrix as 
above, to 0~{n^'^) bit operations. This means the cost of the entire solver is 
0~(n^-^(log(|| A|| -|- ll&ID) bit operations. The algorithm makes use of the notion 
of an efficient sparse projection, for which we currently only offer a construc- 
tion which is conjectured to work in all cases. However, we do provide some 
theoretical evidence to support its applicability, and note its effectiveness in 
practice. 

Most importantly, the new algorithm is shown to offer significant practical 
improvement on sparse integer matrices. The algorithm is implemented in the 
LinBox library a generic C-I--I- library for exact linear algebra. We com- 
pare it against the best known solvers for integer linear equations, in particular 
against the Dixon lifting scheme and Chinese remaindering. We show that in 
practice it runs many times faster than previous schemes on matrices of size 
greater than 2500 x 2500 with sufSently high sparsity. This also demonstrates 
the effectiveness in practice of so-called "asymptotically fast" matrix-polynomial 
techniques, which employ fast matrix/polynomial arithmetic. We provide a de- 
tailed discussion of the implementation, and isolate the performance benefits 
and bottlenecks. A comparison with Maple dense solver emphasizes the high 
efficiency of the LinBox library and the needs of well-designed sparse solvers as 
well. 

2 Block projections 

The basis for Krylov-type linear algebra algorithms is the notion of a projec- 
tion. In Wiedemann's algorithm, for example, we solve the ancillary problem of 
finding the minimal polynomial of a matrix A G p"^" over a field F by choosing 
random u G p^^" and v € F"^^ and computing the minimal polynomial of the 
sequence uA^v for i = 0..2n — 1 (which is both easy to compute and with high 
probability equals the minimal polynomial of A) . As noted in the introduction, 
our scheme will ultimately be different, a hybrid Krylov and lifting scheme, but 
will still rely on the notion of a structured block projection. 

For the remainder of the paper, we adopt the following notation: 

• y4 G F"^" be a non-singular matrix. 
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• s be a divisor of n, the blocking factor, and 

• TO := n/s. 

Ultimately F will be Q and we will have A e Z"^", but for now we work in the 
context of a more general field F. 

For a block v € F"^* and < t < m, define 

JC{A, v):^[v\Av\--- I A^-iw ] € F"^". 

We call a triple (R,u,v) e F"^" x F*^" x F"^** an efficient block projection 
if and only if 

1. IC{AR,v) and IC{{AR)^ jU^) are non-singular; 

2. R can be applied to a vector with 0~{n) operations in F; 

3. we can compute vx, u^x, yv and yu^ for any x € F''^^ and y € F""^^", 
with 0~{n) operations in F. 

In practice we might hope that R, u and v in an efficient block projection 
are extremely simple, for example i? is a diagonal matrix and u and v have only 
n non-zero elements. 

Conjecture 2.1. For any non-singular A G F"^" and s\n, there exists an 
efficient block projection {R,u,v) G f"^"x F^^'^x F"^", and it can be constructed 
quickly. 

2.1 Constructing efficient block projections 

In what follows we present an efficient sparse projection which we conjecture 
to be effective for all matrices. We also present some supporting evidence (if 
not proof) for its theoretical effectiveness. As we shall see in Section 01 the 
projection performs extremely well in practice. 

We focus only on R and v, since its existence should imply the existence of 
a w of similar structure. 

For convenience, assume for now that all elements in v and R are alge- 
braically independent indeterminates, modulo some imposed structure. This is 
sufficient, since the existence of an efficient sparse projection with indeterminate 
entries would imply that a specialization to an effective sparse projection over 
Zp is guaranteed to work with high probability, for sufficiently large p. We also 
consider some different possibilities for choosing R and v. 

2.1.1 Dense Projections 

The "usual" scheme for block matrix algorithms is to choose R diagonal, and v 
dense. The argument to show this works has several steps. First, AR will have 
distinct eigenvalues and thus will be non-derogatory (i.e., its minimal polynomial 
equals its characteristic polynomial). See [2], Lemma 4.1. Second, for any 
non-derogatory matrix B and dense v we have lC{B,v) non-singular (see jlSp. 
However, a dense v is not an efficient block projection since condition (2) is not 
satisfied. 
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2.1.2 Structured Projections 



The following projection scheme is the one we use in practice. Its effectiveness 
in implementation is demonstrated in Section^ 
Choose R diagonal as before. Choose 



(1) 



with each * of dimension m x 1. The intuition behind the structure of v is 
twofold. First, if s = 1 then i> is a dense column vector, and we know K,{AR, v) 
is non-singular in this case. Second, since the case s = 1 requires only n nonzero 
elements in the "block" , it seems that n nonzero elements should suffice in the 
case s > 1 also. Third, if _E is a diagonal matrix with distinct eigenvalues then, 
up to a permutation of the columns, )C{E,v) is a block Vandermonde matrix, 
each mxm block defined via m distinct roots, thus non-singular. In the general 
case with s > 1 we ask: 



Question 2.2. For R diagonal and v as in f7)) . 

singular? 



IC{AR,v) necessarily 



Our work thus far has not led to a resolution of the question. However, 
by focusing on the case s = 2 we have answered the following similar question 
negatively: If A is nonsingular with distinct eigenvalues and w is as in is 
IC{A,v) necessarily nonsingular? 

Lemma 2.3. If m — 2 there exists a nonsingular A with distinct eigenvalues 
such that for V as in the matrix lC{A,v) is singular. 



Proof. We give a counterexample with n — A. Let 
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For the generic block 



ai 
a2 



bi 
b2 

the matrix /C(A, v) is singular. By embedding A into a larger block diagonal 
matrix we can construct a similar counterexample for any n and m = 2. □ 

Thus, if Question 12 . 21 has an affirmative answer, then proving it will neces- 
sitate considering the effect of the diagonal preconditioner R above and beyond 
the fact that "AR has distinct eigenvalues". For example, are the eigenvalues 
of AR algebraically independent, using the fact that entries in R are? This may 
already be sufficient. 



2.1.3 A Positive Result for the Case s = 2 

For s = 2 we can prove the effectiveness of our efficient sparse projection scheme. 

Suppose that A G P"^" where n is even and A is diagonalizable with distinct 
eigenvalues in an extension of F. Then A = X~^DX E p"^" for some diagonal 
matrix D with distinct diagonal entries (in this extension). Note that the rows 
of X can be permuted (replacing X with PX for some permutation P), 

A ^ {{PX)-\P-^DP){PX)), 

and P^^DP is also a diagonal matrix with distinct diagonal entries. Conse- 
quently we may assume without loss of generality that the top left (n/2) x (n/2) 
submatrix Xi i of X is nonsingular. Suppose that 



X = 

and consider the decomposition 



X2.I X2,2 



where 



Z ^ 



x; 



A ^ Z-^AZ, 
' " X^ 



(2) 



/ Zu2 

Z2.1 Z2.2 



x^i 

for n/2 X n/2 matrices Zi^2, ^2,1, and 2^2.2, and where 



A^ 



X 







1.1 

x^} 



D 



Xi,i 




so that 



A^ 



Ai 
A2 
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for matrices Ai and A2. The matrices Ai and A2 are each diagonahzable over 
an extension of F, since A is, and the eigenvalues of these matrices are also 
distinct. 

Notice that, for vectors a, 6 with dimension n/2, and for any nonnegative 
integer i, 



A' 

Thus, if 



Z-^A' 



a 

^2.1 a 



a 

Z2Aa 



and A"- 



22,2^ 



and 



y ■ 



Zi,2b 

Z2,2b 



then the matrix with columns 

a, Aa, A^a, A^/^'^a, b, Ah, A^b, A^-^-ifo 
is nonsingular if and only if the matrix with columns 



X , Ax , A X , 



,A-^'-'x,y,Ay,A\...,A-/^-' 



y 



is nonsingular. The latter condition fails if and only if there exist polynomials / 
and g, each with degree less than n/2, such that at least one of these polynomials 
is nonzero and 

f{A)x + g{A)y = 0. (3) 

To proceed, we should therefore determine a condition on A ensuring that no 
such polynomials / and g exist for some choice of x and y (that is, for some 
choice of a and b). 

A suitable condition on A is easily described: We will require that the top 
right submatrix Zi 2 of Z is nonsingular. 

Now suppose that the entries of the vector b are uniformly and randomly 
chosen from some (sufficiently large) subset of F, and suppose that a = —Zi 2b. 
Notice that at least one of / and g is nonzero if and only if at least one of / 
and g — f is nonzero. Furthermore, 

f{A){x) + g{A)iy) = /(I)(x + y) + {g - f)my). 

It follows by the choice of a that 



y 





(^2,2 ^ Z2,lZi_2)b 



Since A is block diagonal, the top n/2 entries of f{A){x + y) arc nonzero as well 
for every polynomial /. Consequently, failure condition (PJ can only be satisfied 
if the top n/2 entries of the vector {g — f){A){y) are also all zero. 

Recall that g — f has degree less than n/2 and that the top left submatrix 
of the block diagonal matrix A is diagonahzable with n/2 distinct eigenvalues. 
Assuming, as noted above, that Z12 is nonsingular (and recalling that the top 
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half of the vector y is Zi^2b), the Schwartz-Zippel lemma is easily used to show 
that if b is randomly chosen as described then, with high probability, the failure 
condition can only be satisfied if 5 — / = 0. That is, it can only be satisfied if 
f = 9- 

Observe next that, in this case, 

f{A){x)+g{A){y) = f{A){x + y), 

and recall that the bottom half of the vector x+y is the vector {Zi,2 — 'Zi2.\Z\,2)h . 
The matrix Z2.2 — -^2, 1^^1,2 is clearly nonsingular (it is a Schur complement 
formed from so, once again, the Schwartz-Zippel lemma can be used to show 
that if h is randomly chosen as described above then /(A) (a; + y) = if and 
only if / = as well. 

Thus if Z\ 2 is nonsingular and a and h are chosen as described above then, 
with high probability, equation © is satisfied only if / = g 0. There must 
therefore exist a choice of a and h providing an efficient block projection — once 
again, supposing that Z\^2 is nonsingular. 

It remains only to describe a simple and efficient randomization of A that 
achieves this condition with high probability: Let us replace A with the matrix 



A 



I tl 
/ 



A 



I tl 
/ 



/ -tl 
/ 



A 



I tl 
/ 



where t is chosen uniformly from a sufficiently large subset of F. This has the 
effect of replacing Z with the matrix 



Z 



(see, again, Q), effectively replacing Z1.2 with Zi^2 + tl. There are clearly at 
most n/2 choices of t for which the latter matrix is singular. 
Finally, note that if w is a vector and i > then 



" / 


tl ' 




I Z^,2 


+ t/ 





I 




^2.1 Z2.2 + 


tZ2.X 



A'v = 



/ -tl 
/ 



A' 



I tl 
/ 



It follows by this and similar observations that this randomization can be applied 
without increasing the asymptotic cost of the algorithm described in this paper. 

Question: Can the above randomization and proof be generalized to a similar 
result for larger s? 



Other sparse block projections 

Other possible projections are summarized as follows. 

• Iterative Choice Projection. Instead of choosing v all at once, choose 
the columns oi v — [t;i|t;2| • • • \vs\ in succession. For example, suppose up 
to preconditioning we can assume we are working with a _B G F"^" that 
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is simple as well as has the property that the characteristic polynomial 
is irreducible. Then we can choose vi to be the first column of /„ to 
achieve IC{B, vi) e F"^™ of rank m. Next choose V2 to have two nonzero 
entries, locations chosen randomly until [1C{B , vi)\]C{B , V2)] & F"^^™ has 
rank 2m, etc. This gives a v with m(TO + 2)/2 nonzero entries. 

The point of choosing v column by column is that, while choosing all of v 
sparse may have a very small probability of success, the success rate for 
choosing Vi when vi,V2, ■ ■ ■ , fi-i are already chosen may be high enought 
(e.g., maybe only expected O(logn)) choices for Vi before success). 

• Toeplitz projections. Choose R and/or v to have a Toeplitz structure. 

• Vandermonde projections. Choose v to have a Vandermonde or a 
Vandermonde-like structure. 



3 Non-singular sparse solver 

In this section we show how to employ a block-Krylov type method combined 
with the (conjectured) efficient block projections of Section [21 to improve the 
complexity of evaluating the inverse modulo p of a sparse matrix. Applying 
Dixon's p-adic scheme with such an inverse yields an algorithm with better 
complexity than previous methods for sparse matrices, i.e., those with a fast 
matrix-vector product. In particular, we express the cost of our algorithm in 
terms of the number of applications of the input matrix to a vector, plus the 
number of auxiliary operations. 

More precisely, given A G Z"^" and v G Z"^^, let /i(n) be the number of op- 
erations in Z to compute Av or v^A. Then, assuming Coniecture l2.1l our algo- 
rithm requires 
0~(n^-^ (log( II A|| -I- ||6||)) matrix-vector products w 1-^ Aw on vectors w G Z"^^ 
with ||w|| = 0(1), plus 0''(n^-^(log(|| A|| -I- ||6||)) additional bit operations. 

Summarizing this for practical purposes, in the common case of a matrix A G 
•pixn ^[^^ 0~{n) constant-sized non-zero entries, and b G Z"^^ with constant- 
sized entries, we can compute A'~^b with 0~{n^'^) bit operations. 

We achieve this by first introducing a structured inverse of the matrix Ap = 
A mod p which links the problem to block-Hankel matrix theory. We will assume 
that we have an efficient block projection {R,u,v) G Z^^" x Z^^" x Zp^'* for 
Ap, and let B = AR G Z^^". We thus assume we can evaluate Bw and tiF B, 
for any w G Z^^^, with 0~(/i(n)) operations in Zp. The proof of the following 
lemma is left to the reader. 

Lemma 3.1. Let B G Z^^" be non-singular, where n = ms for m,s G Z>o. 
Let u G "Zp^" and v G Z^^* be efficient block projections such that V = 
[v\Bv\ ■ ■ ■ |B™-iw] G l^""" and C/^ = [u^\B^u^\ ■ ■ ■ \{B^)"'-^u^] G Z^>^" are 
non-singular. The matrix H = UBV G Z^^" is then a block-Hankel matrix, 
and the inverse for B can be written as B^^ — VH^^U. 
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In fact 



H = 



a2 as • • • a„ 



e (4) 



with Ui = uB^v G Z^^* for i = 1 . . . 2m — 1. H can thus be computed with 
2m — 1 applications of to a (block) vector plus 2m — 1 pre-multiplications by 
u, for a total cost of 2n/i(n) + 0~(n^) operations in Zp. For a word-sized prime 
p, we can find H with 0~{nfi{n)) bit operations (where, by "word-sized", we 
mean having a constant number of bits, typically 32 or 64, depending upon the 
register size of the target machine). 

We will need to apply H^^ to a number of vectors at each lifting step and 
so require that this be done efhciently. We will do this by fist representing 
using the off-diagonal inverse formula of .17; : 




where at, a* , Pi, (3* G Z^^". 

This representation can be computed using the Sigma Basis algorithm of 
Beckermann-Labahn ^7]. We use the version given in UT] which ensures the 
desired complexity in all cases. This requires 0~{s^m) operations in Zp (and 
will only be done once during the algorithm, as pre-computation to the lifting 
steps). 

The Toeplitz/Hankel forms of the components in this formula allow to eval- 
uate H^^w for any w G Zp'^^ with ©"(s^m) or (Tins) operations in Zp using 
an FFT-based polynomial multiplication (see 1 ). An alternative to computing 
the inversion formula would be to use the generalization of the Levinson-Durbin 
algorithm in 14 . 

Corollary 3.2. Assume that we have pre-computed H^^ G Zp^" for a word- 
sized prime p. Then, for any v G Zp^^, we can compute B^^v mod p with 
2(m — l)/x(n) -|- 0~{n{m + s)) operations in 1^. 

Proof. By Lemma iH.ll we can express the application of B^^ to a vector by an 
application of J7, followed by an application of followed by an application 
of V^. 

To apply [/ to a vector G Zp^^, we note that 

{Uwf = \{uwf, {uBwf, [uB^'-^fwYf. 

We can find this iteratively, for i = 0, . . . , m— 1, by computing hi — B^w — Bbi-i 
(assume bo — w) and uB'w — ubi, for i = 0..m — 1 in sequence. This requires 
(m — l)/i(n) + 0~{mn) operations in Zp. 
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To apply y to a vector ?/ € Z^^\ write y = [yolyil • • • \ym-iV , where G Z^. 
Then 

V'y = vya + Bt;yi + B'^vy2 + ■■■ + B™-iwy,„_i 

= rao + B {vxi + i? {vxi H ((wa;m_2 + Bvxm-i) •■■))) 

which can be accomphshed with m — 1 applications of B and m applications of 
the projection v. This requires (m — l)/i(n) + 0~{mn) operations in Zp. □ 

P-adic scheme 

We employ the inverse computation described above in the p-adic lifting algo- 
rithm of Dixon We briefly describe the method here and demonstrate its 
complexity in our setting. 

Input: A e Z"^" non-singular, b £ Z"^i; 
Output: A-^be Q"^i 

(1) Choose a prime p such that det A ^ mod p; 

(2) Determine an efficient block projection for A: 
R,u,ve Z"^" X Z;^" X Z;j^"; Let B = AR; 

(3) Compute = for i = 1 . . . 2m — 1 and define as in Q. Recall 

that B-^ = VH-^U; 

(4) Compute the inverse formula of (see above); 

(5) Let £ := f . \\og^{n\\Ar) + logp((n - + 

6o := 6; 

(6) For i from to do 

(7) Xi :~ B^^bi mod p; 

(8) b,+i:=p-^{b,- Bx^) 

(9) Reconstruct x e Q"^^ from a;^ using rational reconstruction. 

Theorem 3.3. The above p-adic scheme solves the linear system A^^b with 
0"'(n"'^'^(log(|| A|| 4- |j6||)) matrix-vector products by Amodp (for a machines- 
word sized prime p) plus 0"'(n^'^(log(|| A|| -|- ||&||)) additional bit- operations. 

Proof. The total cost of the algorithm is 0~{nii{n) + n'^ + nlog{\\A\\ -I- ||6||)(m/z-|- 
n{m -\- s)). For the optimal choice of s = ^/n and m = n/s, this is easily seen 
to equal the stated cost. The rational reconstruction in the last step is easily 
accomplished using radix conversion (see, e.g., combined with continued 
fraction theory, in a cost which is dominated by the other operations (see 
for details). □ 
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4 Efficient implementation 



An implementation of our algorithm has been done in the LmBoxhbrary [H]. 
This is a generic CH — h hbrary which offers both high performance and the 
flexibihty to use highly tuned libraries for critical components. The use of 
hybrid dense linear algebra routines based on fast numerical routine such 
as BLAS, is one of the successes of the library. Introducing blocks to solve 
integer sparse linear systems is then an advantage since it allows us to use such 
fast dense routines. One can see in Section that this becomes necessary to 
achieve high performance, even for sparse matrices. 

4.1 Optimizations 

In order to achieve the announced complexity we need to use asymptotically fast 
algorithms, in particular to deal with polynomial arithmetic. One of the main 
concerns is then the computation of the inverse of the block-Hankel matrix and 
the matrix- vector products with the block-Hankcl/Toeplitz matrix. 

Consider the block-Hankel matrix H £ Z^^" defined by 2m — 1 blocks of 
dimension s denoted in equation Q. Let us denote the matrix power series 

H{z) ^ ai + a2Z + . . . + a2m^iz^"'~'^ . 

One can compute the off-diagonal inverse formula of H using 17 theorem 3.1] 
with the computation of 

• two left sigma bases of [H{zy \ I]'^ of degrees 2m — 2 and 2m, and 

• two right sigma bases of [H{z) \ I] of degrees 2m — 2 and 2m. 

This computation can be done with 0~{s'^m) field operation with the fast 
algorithm PM-Basis of UTj. However, the use of a slower algorithm such as 
M-Basis of JT, will give a complexity of O(s'^m^) or 0{n^s) field operations. In 
theory, the latter is not a problem since the optimal s is equal to y^, and thus 
gives a complexity of 0{n?'^) field operations, which still yields the announced 
complexity. 

In practice, we developed implementations for both algorithms {M-Basis 
and PM-Basis) , using the efficient dense linear algebra of and an FFT-based 
polynomial matrix multiplication. Nevertheless, due to the special structure 
of the series to approximate, the use of a third implementation based on a 
modified version of M-Basis, where only half of the first columns (or rows) of 
the basis are computed, allows us to achieve the best performance. Note that 
the approximation degrees remain small (less than 1 000) . 

Another important point in our algorithm is the application of the off di- 
agonal inverse to a vector x G Z^^^. This computation reduces to polynomial 
matrix- vector product; x is cut into chunks of size s. Contrary to the block- 
Hankel matrix inverse computation, we really need to use fast polynomial arith- 
metic to achieve our complexity. However, we can avoid the use of FFT-based 
arithmetic since the evaluation of H^^, which is the dominant cost, can be done 
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only once at the beginning of the hfting. Let t = 0{rn) be the number of evalu- 
ation points. One can evaluate at t points using Horner's rules with O(n^) 
field operations. 

Hence, applying in each lifting step reduces to the evaluation of a vector 
y e 1p[xY^^ of degree m at t points, to computing t matrix- vector product of 
dimension s, and to interpolating the result. The cost for each application of 
is then 0{m?s + ras^) field operations, giving 0{ln}^^) field operations for 
the optimal choice of s = to = ^Jn. This cost is deduced easily from Horner's 
evaluation and Lagrange's interpolation. 

To achieve better performances in practice, we use a Vandermonde matrix 
and its inverse to perform the evaluation/interpolation steps. This allows us to 
maintain the announced complexity, and to benefit from the fast dense linear 
algebra routine of LinBox library. 

4.2 Timings 

We now compare the performance of our new algorithm against the best known 
solvers. As noted earlier, the previously best known complexity for algorithms 
solving integer linear systems is 0"'(n'^ log(| |A| | -1- ||6||)) bit operations, indepen- 
dent of their sparsity. This can be achieved with several algorithms: Wiede- 
mann's technique combined with the Chinese remainder algorithm [21], Wiede- 
mann's technique combined withp-adic lifting '16', or Dixon's algorithm ^ . All 
of these algorithms are implemented within the LinBox library and we ensure 
they benefits from the optimized code and libraries to the greatest extent possi- 
ble. In our comparison, we refer to these algorithms by respectively: CRA-Wied, 
P-adic-Wied and Dixon. In order to give a timing reference, we also compare 
against the dense Maple solver. Note that algorithm used by Maple 10 has a 
quartic complexity in matrix dimension. 

In the following, matrices are chosen randomly sparse, with fixed or variable 
sparsity, and some non-zero diagonal elements are added in order to ensure the 
non-singularity. 





400 


900 


1600 


2500 


3600 






Maple 


64.7s 


849s 


11098s 








CRA-Wicd 


14.8s 


168s 


1017s 


3857s 


11452s 




P-adic-Wied 


10.2s 


113s 


693s 


2629s 


8034s 




Dixon 


0.9s 


10s 


42s 


178s 


429s 




Our algo. 


2.4s 


15s 


61s 


175s 


426s 



Table 1: Solving sparse integer linear system (10 non-zero elts per row) on a 
Itanium2, 1.3GHz 

First, one can see from Tabled that even if most of the algorithms have 
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the same complexity, their performance varies widely. The P-adic-Wied imple- 
mentation is a bit faster than CRA-Wied since the matrix reduction modulo 
a prime number and the minimal polynomial computation is done only once, 
contrary to the 0~{ri) times needed by CRA. Another important feature of this 
table is to show the efficiency of dense LinBox 's routines compared to sparse 
routines. One can notice the improvement by a factor 10 to 20 with Dixon. An 
important point to note is that 0{n) sparse matrix- vector products is not as fast 
in practice as one dense matrix- vector product. Our new algorithm completely 
benefits from this remark and allows it to achieve similar performances to Dixon 
on smaller matrices, and to outperform it for larger matrices. 

In order to emphasize the asymptotic benefit of our new algorithm, we now 
compare it on larger matrices with different levels of sparsity. In Figure ^ we 
study the behaviour of our algorithm compared to that of Dixon with fixed 
sparsity (10 and 30 non-zero elements per rows). The goal is to conserve a fixed 
exponent in the complexity of our algorithm. 

Timings comparison Dixon Vs Our algo. for integer iinear system soiving 

60000 



50000 
40000 

CO 
■D 

c 
o 

D 

g 30000 

g 
E 
i- 

20000 
10000 


6000 8000 10000 12000 14000 

Matrix dimension 

Figure 1: Comparing our algo. with Dixon's algorithm (fixed sparsity) on a 
Itamum2, l.SGHz 

With 10 non-zero element per row, our algorithm is always faster than 
Dixon's and the gain tends to increase with matrix dimension. Its not exactly 
the same behaviour when matrices have 30 non-zero clement per row. For small 
matrices, Dixon still outperforms our algorithm. The crossover appears only 
after dimension 10 000. This phenomenon is explained by the fact that sparse 
matrix operations remain too costly compared to dense ones until matrix di- 
mensions become sufficiently large that the overall asymptotic complexity plays 
a more important role. 

14 




This explanation is verified in Figure 12 where different sparsity percentages 
are used. The sparser the matrices are, the earlier the crossover appears. For 
instance, with a sparsity of 0.07%, our algorithm becomes more efficient than 
Dixon's for matrices dimension greater than 1600, while this is only true for 
dimension greater than 2500 with a sparsity of 1%. Another phenomenon when 
examining matrices of a fixed percentage density is emphasized by the Figure 
12 This is because Dixon's algorithm again becomes the most efficient, in this 
case, when the matrices become large. This is explained by the variable sparsity 
which leads to a variable complexity. For a given sparsity, the larger the matrix 
dimensions the more non-zero entries per row, and the more costly our algorithm 
is. As an example, with 1% of non zero element, the complexity is doubled from 
matrix dimension n = 3 000 to n = 6 000. As a consequence, the performances 
of our algorithm drop with matrix dimension in this particular case. 



Timings ratio of our algo. /Dixon for infeger linear system solving 




3000 4000 
Matrix dimension 



Figure 2: Gain of our algo. from Dixon's algorithm (variable sparsity) on a 
Itanium2, 1.3GHz 



4.3 The practical effect of different blocking factors 

In order to achieve even better performance, one can try to use different block 
dimensions rather than the theoretical optimal \/n. The Table [3 studies exper- 
imental blocking factors for matrices of dimension n = 10 000 and n = 20 000 
with a fixed sparsity of 10 non-zero elements per rows. 

One notices that the best experimental blocking factors are far from the 
optimal theoretical ones (e.g., the best blocking factor is 400 when n = 10 000 
whereas theoretically it is 100). This behaviour is not surprising since the larger 
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n= 10 000 



block size 


80 


125 


200 


400 


500 


timing 


7213s 


5264s 


4059s 


3833s 


4332s 



n= 20 000 



block size 


125 


160 


200 


500 


800 


timing 


44720s 


35967s 


30854s 


28502s 


37318s 



Table 2: Blocking factor impact (sparsity= 10 alts per row) on a Itanium2, 
1.3GHz 



the blocking factor is, the fewer sparse matrix operations and the more dense 
matrix operations are performed. As we already noted earlier, operations are 
performed more efficiently when they are dense rather than sparse (the cache 
effect is of great importance in practice). However, as shown in Table |21 if the 
block dimensions become too large, the overall complexity of the algorithm in- 
creases and then becomes too important compared to Dixon's. A function which 
should give a good approximation of the best practical blocking factor would 
be based on the practical efficiency of sparse matrix-vector product and dense 
matrix operations. Minimizing the complexity according to this efficiency would 
lead to a good candidate blocking factor. This could be done automatically at 
the beginning of the lifting by checking efficiency of sparse matrix-vector and 
dense operation for the given matrix. 

Concluding remarks 

We give a new approach to solving sparse linear algebra problems over the inte- 
gers by using sparse or structured block projections. The algorithm we exhibit 
works well in practice. We demonstrate it on a collection of very large ma- 
trices and compare it against other state-of-the art algorithms. Its theoretical 
complexity is sub-cubic in terms of bit complexity, though it rests still on a 
conjecture which is not proven in the general case. We offer a rigorous treat- 
ment for a small blocking factor (2) and provide some support for the general 
construction. 

The use of a block-Krylov-like algorithm allows us to link the problem of 
solving sparse integer linear systems to polynomial linear algebra, where we can 
benefit from both theoretical advances in this field and from the efficiency of 
dense linear algebra libraries. In particular, our experiments point out a general 
efRciency issue of sparse linear algebra: in practice, are (many) sparse operations 
as fast as (correspondingly fewer) dense operations? We have tried to show 
in this paper a negative answer to this question. Therefore, our approach to 
providing efficient implementations for sparse linear algebra problems has been 
to reduce most of the operations to dense linear algebra on a smaller scale. This 
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work demonstrates an initial success for this approach (for integer matrices), 
and it certainly emphasizes the importance of well-designed (both theoretically 
and practically) sparse, symbolic linear algebra algorithms. 
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